{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Sliceguard – Find critical data segments in your data (fast)\n",
    "## Mixed Data Walkthrough"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🕹️ **[Interactive Demo](https://huggingface.co/spaces/renumics/sliceguard-mixed-data)**\n",
    "\n",
    "Sliceguard is a python library for quickly finding **critical data slices** like outliers, errors, or biases. It works on **structured** and **unstructured** data.\n",
    "\n",
    "This notebook showcases especially the **mixed** data case. If you are specifically interested in structured data or unstructured data analysis, please refer to the specific guides for **[structured data](./quickstart_structured_data.ipynb)** and **[unstructured data](./quickstart_unstructured_data.ipynb)** respectively.\n",
    "\n",
    "It is interesting for you if you want to do the following:\n",
    "1. Find **performance issues** of your machine learning model.\n",
    "2. Find **anomalies and inconsistencies** in your data.\n",
    "3. Quickly **explore** your data using an interactive report to generate **insights**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run this notebook install and import sliceguard:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U sliceguard[all]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sliceguard import SliceGuard\n",
    "from sliceguard.data import from_huggingface\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.svm import SVR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now download the demo dataset from the huggingface hub:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = from_huggingface(\"alfredodeza/wine-ratings\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Subsample dataframe for quicker execution\n",
    "df = df.sample(2000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>region</th>\n",
       "      <th>variety</th>\n",
       "      <th>rating</th>\n",
       "      <th>notes</th>\n",
       "      <th>split</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7602</th>\n",
       "      <td>Cadence Coda 2007</td>\n",
       "      <td>Red Mountain, Yakima Valley, Columbia Valley, ...</td>\n",
       "      <td>Red Wine</td>\n",
       "      <td>90.0</td>\n",
       "      <td>57% Merlot, 18% Cabernet Sauvignon, 13% Cabern...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13365</th>\n",
       "      <td>Chateau Sociando-Mallet 2015</td>\n",
       "      <td>Haut Medoc, Bordeaux, France</td>\n",
       "      <td>Red Wine</td>\n",
       "      <td>90.0</td>\n",
       "      <td>The wines of Sociando-Mallet are characterized...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>778</th>\n",
       "      <td>Alex Gambal Puligny-Montrachet Les Enseigneres...</td>\n",
       "      <td>Puligny-Montrachet, Cote de Beaune, Cote d'Or,...</td>\n",
       "      <td>White Wine</td>\n",
       "      <td>93.0</td>\n",
       "      <td>A powerful wine with a great balance of fruit ...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20632</th>\n",
       "      <td>Dutschke Saint Jakobi Shiraz 1999</td>\n",
       "      <td>Spain</td>\n",
       "      <td>Red Wine</td>\n",
       "      <td>89.0</td>\n",
       "      <td>The 1999 St. Jakobi has rich mouth filling dar...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24803</th>\n",
       "      <td>Gini Classico Superiore Soave 2005</td>\n",
       "      <td>Soave, Veneto, Italy</td>\n",
       "      <td>White Wine</td>\n",
       "      <td>90.0</td>\n",
       "      <td>Color: Straw color with green-gold reflections.</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1405</th>\n",
       "      <td>Amici Sonoma Coast Chardonnay 2016</td>\n",
       "      <td>Sonoma Coast, Sonoma County, California</td>\n",
       "      <td>White Wine</td>\n",
       "      <td>90.0</td>\n",
       "      <td>The 2016 Amici Cellars Chardonnay Sonoma Coast...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28081</th>\n",
       "      <td>Jaboulet Condrieu Domaine des Grands Amandiers...</td>\n",
       "      <td>Condrieu, Rhone, France</td>\n",
       "      <td>White Wine</td>\n",
       "      <td>93.0</td>\n",
       "      <td>Viognier grapes never express their minerality...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26969</th>\n",
       "      <td>Hess Collection Napa Valley Chardonnay 2015</td>\n",
       "      <td>Napa Valley, California</td>\n",
       "      <td>White Wine</td>\n",
       "      <td>90.0</td>\n",
       "      <td>Aged for 9 months in barrels and less stirred ...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17042</th>\n",
       "      <td>Delas Les Launes Crozes Hermitage Rouge 2006</td>\n",
       "      <td>Rhone, France</td>\n",
       "      <td>Red Wine</td>\n",
       "      <td>91.0</td>\n",
       "      <td>The color is a deep garnet red. The nose is es...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2563</th>\n",
       "      <td>Armanino The Whitewing Pinot Noir 2012</td>\n",
       "      <td>Russian River, Sonoma County, California</td>\n",
       "      <td>Red Wine</td>\n",
       "      <td>92.0</td>\n",
       "      <td>The Whitewing is something to behold. Every as...</td>\n",
       "      <td>train</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2000 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    name  \\\n",
       "7602                                   Cadence Coda 2007   \n",
       "13365                       Chateau Sociando-Mallet 2015   \n",
       "778    Alex Gambal Puligny-Montrachet Les Enseigneres...   \n",
       "20632                  Dutschke Saint Jakobi Shiraz 1999   \n",
       "24803                 Gini Classico Superiore Soave 2005   \n",
       "...                                                  ...   \n",
       "1405                  Amici Sonoma Coast Chardonnay 2016   \n",
       "28081  Jaboulet Condrieu Domaine des Grands Amandiers...   \n",
       "26969        Hess Collection Napa Valley Chardonnay 2015   \n",
       "17042       Delas Les Launes Crozes Hermitage Rouge 2006   \n",
       "2563              Armanino The Whitewing Pinot Noir 2012   \n",
       "\n",
       "                                                  region     variety  rating  \\\n",
       "7602   Red Mountain, Yakima Valley, Columbia Valley, ...    Red Wine    90.0   \n",
       "13365                       Haut Medoc, Bordeaux, France    Red Wine    90.0   \n",
       "778    Puligny-Montrachet, Cote de Beaune, Cote d'Or,...  White Wine    93.0   \n",
       "20632                                              Spain    Red Wine    89.0   \n",
       "24803                               Soave, Veneto, Italy  White Wine    90.0   \n",
       "...                                                  ...         ...     ...   \n",
       "1405             Sonoma Coast, Sonoma County, California  White Wine    90.0   \n",
       "28081                            Condrieu, Rhone, France  White Wine    93.0   \n",
       "26969                            Napa Valley, California  White Wine    90.0   \n",
       "17042                                      Rhone, France    Red Wine    91.0   \n",
       "2563            Russian River, Sonoma County, California    Red Wine    92.0   \n",
       "\n",
       "                                                   notes  split  \n",
       "7602   57% Merlot, 18% Cabernet Sauvignon, 13% Cabern...  train  \n",
       "13365  The wines of Sociando-Mallet are characterized...  train  \n",
       "778    A powerful wine with a great balance of fruit ...  train  \n",
       "20632  The 1999 St. Jakobi has rich mouth filling dar...  train  \n",
       "24803   Color: Straw color with green-gold reflections.   train  \n",
       "...                                                  ...    ...  \n",
       "1405   The 2016 Amici Cellars Chardonnay Sonoma Coast...  train  \n",
       "28081  Viognier grapes never express their minerality...  train  \n",
       "26969  Aged for 9 months in barrels and less stirred ...  train  \n",
       "17042  The color is a deep garnet red. The nose is es...  train  \n",
       "2563   The Whitewing is something to behold. Every as...  train  \n",
       "\n",
       "[2000 rows x 6 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show dataframe\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for data slices that are particulary different (Outliers/Errors in the data)\n",
    "Here sliceguard will train an **outlier detection** model to highlight data points that are especially different from the rest. Note that you can simply use **structured data** like the categorical variables *variety* and *region* in parallel to **unstructured data** like *notes* or *name*. Sliceguard will do embedding calculation and proper normalization internally. However, beware that often raw data and embeddings are way richer than a categorical field with only 5 unique values. This makes it much more likely sliceguard will find isolated clusters based on embeddings. You can however use the \"embedding_weights\" parameter. To lower the influence of specific embeddings manually.\n",
    "\n",
    "You can then use the **report feature** that uses [Renumics Spotlight](https://github.com/Renumics/spotlight) for visualization to dig into the reasons why a cluster is considered an outlier. For mixed data it can especially make sense to use the inspector view to visualize unstructured data in parallel to visualizing structured data by using Histograms, Scatterplots, and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sg = SliceGuard()\n",
    "issues = sg.find_issues(df, features=[\"notes\", \"variety\"], embedding_weights={\"notes\": 0.5}) # Play with the embedding weights parameter a bit. More fun in richer datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sg.report()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for data slices where models are prone to fail (hard samples, inconsistencies)\n",
    "Here sliceguard will **train a regression model** and check for data slices where the mse score is particulary bad. You will realize that in general for the model it is hard to determine the proper rating from the notes and variety. However, there are certain patterns you can uncover, especially **uninformative notes** such as \"Ex-chateau release\" that do not contain any information for generalizing on other data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train the model and predict on the same data (of course in practice you will want to split your data!!!)\n",
    "# This is only for showing the principle\n",
    "sg = SliceGuard()\n",
    "issues = sg.find_issues(df,\n",
    "                        features=[\"notes\", \"variety\"],\n",
    "                        y=\"rating\",\n",
    "                        metric=mean_squared_error,\n",
    "                        automl_task=\"regression\",\n",
    "                       ) # also try out drop_reference=\"parent\" for more class-specific results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sg.report()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save embeddings for later use in own model example\n",
    "notes_embeddings = sg.embeddings[\"notes\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for weaknesses of your own model (...and hard samples + inconsistencies)\n",
    "This shows how to pass your **own model predictions** into sliceguard to find slices that are performing badly according to a supplied metric function. This allows you to uncover **inconsistencies** and samples that are **hard to learn** in no time!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train the model and predict on the same data (of course in practice you will want to split your data!!!)\n",
    "# This is only for showing the principle\n",
    "clf = SVR()\n",
    "clf.fit(notes_embeddings, df[\"rating\"])\n",
    "df[\"predictions\"] = clf.predict(notes_embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pass the predictions to sliceguard and uncover hard samples and inconsistencies.\n",
    "sg = SliceGuard()\n",
    "issues = sg.find_issues(df,\n",
    "                        features=[\"notes\"],\n",
    "                        y=\"rating\",\n",
    "                        y_pred=\"predictions\",\n",
    "                        metric=mean_squared_error,\n",
    "                        metric_mode=\"min\",\n",
    "                        precomputed_embeddings={\"notes\": notes_embeddings})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sg.report()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
